Feedback should be send to goran.milovanovic@datakolektiv.com.
These notebooks accompany the DATA SCIENCE SESSIONS VOL. 3 :: A Foundational Python Data Science Course.


### --- Setup - importing the libraries
# - supress those annoying 'Future Warning'
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
# - data
import numpy as np
import pandas as pd
# - os
import os
# - ml
import statsmodels.api as sm
import statsmodels.formula.api as smf
# - visualization
import matplotlib.pyplot as plt
import seaborn as sns
# - parameters
%matplotlib inline
pd.options.mode.chained_assignment = None # default='warn'
sns.set_theme()
# - rng
rng = np.random.default_rng()
# - plots
plt.rc("figure", figsize=(8, 6))
plt.rc("font", size=14)
sns.set_theme(style='white')
# - directory tree
data_dir = os.path.join(os.getcwd(), '_data')
# - loading the dataset
# - Kaggle: https://www.kaggle.com/datasets/aungpyaeap/fish-market
# - place it in your _data/ directory
fish_data = pd.read_csv(os.path.join(data_dir, 'Fish.csv'))
fish_data.head(10)
| Species | Weight | Length1 | Length2 | Length3 | Height | Width | |
|---|---|---|---|---|---|---|---|
| 0 | Bream | 242.0 | 23.2 | 25.4 | 30.0 | 11.5200 | 4.0200 |
| 1 | Bream | 290.0 | 24.0 | 26.3 | 31.2 | 12.4800 | 4.3056 |
| 2 | Bream | 340.0 | 23.9 | 26.5 | 31.1 | 12.3778 | 4.6961 |
| 3 | Bream | 363.0 | 26.3 | 29.0 | 33.5 | 12.7300 | 4.4555 |
| 4 | Bream | 430.0 | 26.5 | 29.0 | 34.0 | 12.4440 | 5.1340 |
| 5 | Bream | 450.0 | 26.8 | 29.7 | 34.7 | 13.6024 | 4.9274 |
| 6 | Bream | 500.0 | 26.8 | 29.7 | 34.5 | 14.1795 | 5.2785 |
| 7 | Bream | 390.0 | 27.6 | 30.0 | 35.0 | 12.6700 | 4.6900 |
| 8 | Bream | 450.0 | 27.6 | 30.0 | 35.1 | 14.0049 | 4.8438 |
| 9 | Bream | 500.0 | 28.5 | 30.7 | 36.2 | 14.2266 | 4.9594 |
fish_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 159 entries, 0 to 158 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Species 159 non-null object 1 Weight 159 non-null float64 2 Length1 159 non-null float64 3 Length2 159 non-null float64 4 Length3 159 non-null float64 5 Height 159 non-null float64 6 Width 159 non-null float64 dtypes: float64(6), object(1) memory usage: 8.8+ KB
model_frame = fish_data[['Height', 'Weight']]
model_frame
| Height | Weight | |
|---|---|---|
| 0 | 11.5200 | 242.0 |
| 1 | 12.4800 | 290.0 |
| 2 | 12.3778 | 340.0 |
| 3 | 12.7300 | 363.0 |
| 4 | 12.4440 | 430.0 |
| ... | ... | ... |
| 154 | 2.0904 | 12.2 |
| 155 | 2.4300 | 13.4 |
| 156 | 2.2770 | 12.2 |
| 157 | 2.8728 | 19.7 |
| 158 | 2.9322 | 19.9 |
159 rows × 2 columns
# - fitting the linear model to the data
linear_model = smf.ols(formula='Weight ~ Height', data=model_frame).fit()
linear_model.summary()
| Dep. Variable: | Weight | R-squared: | 0.525 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.522 |
| Method: | Least Squares | F-statistic: | 173.3 |
| Date: | Thu, 20 Apr 2023 | Prob (F-statistic): | 3.84e-27 |
| Time: | 14:04:11 | Log-Likelihood: | -1101.0 |
| No. Observations: | 159 | AIC: | 2206. |
| Df Residuals: | 157 | BIC: | 2212. |
| Df Model: | 1 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | -144.3860 | 45.663 | -3.162 | 0.002 | -234.579 | -54.193 |
| Height | 60.4964 | 4.595 | 13.164 | 0.000 | 51.419 | 69.573 |
| Omnibus: | 95.711 | Durbin-Watson: | 0.279 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 482.586 |
| Skew: | 2.288 | Prob(JB): | 1.61e-105 |
| Kurtosis: | 10.204 | Cond. No. | 23.3 |
Linear model has the form
$$y = \beta_1 x + \beta_0 + \varepsilon,$$where
The predicted value $\hat{y}$ of the target variable is computed via Liner regression via
$$\hat{y} = \beta_1 x + \beta_0.$$# - model's parameters; Hight represent the slope k
linear_model.params
Intercept -144.385971 Height 60.496351 dtype: float64
# - predicting the value using model's formula and parameters
model_frame['Predicted Values'] = linear_model.predict()
# - calculating the residuals - the difference between the true and predicted values
model_frame['Residuals'] = linear_model.resid
model_frame
| Height | Weight | Predicted Values | Residuals | |
|---|---|---|---|---|
| 0 | 11.5200 | 242.0 | 552.531995 | -310.531995 |
| 1 | 12.4800 | 290.0 | 610.608492 | -320.608492 |
| 2 | 12.3778 | 340.0 | 604.425765 | -264.425765 |
| 3 | 12.7300 | 363.0 | 625.732580 | -262.732580 |
| 4 | 12.4440 | 430.0 | 608.430623 | -178.430623 |
| ... | ... | ... | ... | ... |
| 154 | 2.0904 | 12.2 | -17.924399 | 30.124399 |
| 155 | 2.4300 | 13.4 | 2.620162 | 10.779838 |
| 156 | 2.2770 | 12.2 | -6.635779 | 18.835779 |
| 157 | 2.8728 | 19.7 | 29.407947 | -9.707947 |
| 158 | 2.9322 | 19.9 | 33.001430 | -13.101430 |
159 rows × 4 columns
# - some statistics on the residuals
model_frame['Residuals'].describe()
count 1.590000e+02 mean -2.288037e-14 std 2.468036e+02 min -3.572937e+02 25% -1.165323e+02 50% -7.076091e+01 75% 3.391263e+01 max 1.163621e+03 Name: Residuals, dtype: float64
# - plotting the true data, predicted values and the prediction line
sns.regplot(data=model_frame, x='Height', y='Weight', ci=0, line_kws={'color':'red'})
sns.scatterplot(data=model_frame, x='Height', y='Predicted Values', color='red', s=50)
sns.despine(offset=10, trim=True)
plt.title('Weight ~ Hight', fontsize=14);
Ok, statsmodels can do it; how do we find out about the optimal values of $\beta_0$ and $\beta_1$?
Let's build ourselves a function that (a) tests for some particular values of $\beta_0$ and $\beta_1$ for a particular regression problem (i.e. for a particular dataset) and returns the model error.
The model error? Oh. Remember the residuals:
$$\epsilon_i = y_i - \hat{y_i}$$where $y_i$ is the observation to be predicted, and $\hat{y_i}$ the actual prediction?
Next we do something similar to what happens in the computation of variance, square the differences:
$$\epsilon_i^2 = (y_i - \hat{y_i})^2$$and define the model error for all observations to be the sum of squares:
$$SSE = \sum_{i=1}^{N}(y_i - \hat{y_i})^2$$Obviously, the lower the $SSE$ - the Sum of Squared Error - the better the model! Here's a function that returns the SSE for a given data set (with two columns: the predictor and the criterion) and a choice of parameters $\beta_0$ and $\beta_1$:
# - sse function
def lg_sse(pars):
# - pick up the parameters
beta_0 = pars[0]
beta_1 = pars[1]
# - predict from parameters
preds = beta_0+beta_1*fish_data['Height']
# - compute residuals
residuals = fish_data['Weight']-preds
# - square the residuals
residuals = residuals**2
# - sum of squares
residuals = residuals.sum()
# - out:
return residuals
Test lg_sse() now:
pars = [-144.385971, 60.496351]
print(lg_sse(pars))
9624100.07729549
Check via statsmodels:
(linear_model.resid**2).sum()
9624100.07729549
Method A. Random parameter space search
beta_0 = np.random.uniform(low=-200, high=200, size=10000)
beta_1 = np.random.uniform(low=-200, high=200, size=10000)
random_pars = pd.DataFrame({'beta_0':beta_0, 'beta_1':beta_1})
random_pars.head()
| beta_0 | beta_1 | |
|---|---|---|
| 0 | -144.844441 | 6.840789 |
| 1 | -161.372406 | -12.539866 |
| 2 | 82.661221 | 166.498917 |
| 3 | 74.559114 | -93.551042 |
| 4 | -98.924844 | -57.612271 |
sse = []
for i in range(random_pars.shape[0]):
pars = [random_pars['beta_0'][i],random_pars['beta_1'][i]]
sse.append(lg_sse(pars))
random_pars['sse'] = sse
random_pars.sort_values('sse', ascending=True, inplace=True)
random_pars.head()
| beta_0 | beta_1 | sse | |
|---|---|---|---|
| 6018 | -139.299550 | 59.759600 | 9.626044e+06 |
| 7157 | -152.671358 | 61.688325 | 9.629146e+06 |
| 4790 | -133.125571 | 59.855602 | 9.630123e+06 |
| 3564 | -139.380501 | 59.391128 | 9.631478e+06 |
| 8613 | -153.179392 | 60.576304 | 9.634489e+06 |
Check with statsmodels:
linear_model.params
Intercept -144.385971 Height 60.496351 dtype: float64
Not bad, how about 100,000 random pairs?
beta_0 = np.random.uniform(low=-200, high=200, size=100000)
beta_1 = np.random.uniform(low=-200, high=200, size=100000)
random_pars = pd.DataFrame({'beta_0':beta_0, 'beta_1':beta_1})
sse = []
for i in range(random_pars.shape[0]):
pars = [random_pars['beta_0'][i],random_pars['beta_1'][i]]
sse.append(lg_sse(pars))
random_pars['sse'] = sse
random_pars.sort_values('sse', ascending=True, inplace=True)
random_pars.head()
| beta_0 | beta_1 | sse | |
|---|---|---|---|
| 42262 | -142.922147 | 60.198786 | 9.624588e+06 |
| 91647 | -139.732331 | 60.172505 | 9.624891e+06 |
| 27821 | -141.405253 | 60.013639 | 9.625066e+06 |
| 61892 | -147.718715 | 61.011229 | 9.625133e+06 |
| 57573 | -140.966401 | 60.407449 | 9.625216e+06 |
Method B. Grid search
beta_0_vals = np.linspace(-200,200,100)
beta_1_vals = np.linspace(-200,200,100)
grid = np.array([(beta_0, beta_1) for beta_0 in beta_0_vals for beta_1 in beta_1_vals])
grid = pd.DataFrame(grid)
grid = grid.rename(columns={0: "beta_0", 1: "beta_1"})
grid.head()
| beta_0 | beta_1 | |
|---|---|---|
| 0 | -200.0 | -200.000000 |
| 1 | -200.0 | -195.959596 |
| 2 | -200.0 | -191.919192 |
| 3 | -200.0 | -187.878788 |
| 4 | -200.0 | -183.838384 |
sse = []
for i in range(grid.shape[0]):
pars = [grid['beta_0'][i],grid['beta_1'][i]]
sse.append(lg_sse(pars))
grid['sse'] = sse
grid.sort_values('sse', ascending=True, inplace=True)
grid.head()
| beta_0 | beta_1 | sse | |
|---|---|---|---|
| 1864 | -127.272727 | 58.585859 | 9.634695e+06 |
| 1964 | -123.232323 | 58.585859 | 9.637258e+06 |
| 965 | -163.636364 | 62.626263 | 9.637272e+06 |
| 1764 | -131.313131 | 58.585859 | 9.637324e+06 |
| 1065 | -159.595960 | 62.626263 | 9.639683e+06 |
A grid more dense:
beta_0_vals = np.linspace(-200,200,1000)
beta_1_vals = np.linspace(-200,200,1000)
grid = np.array([(beta_0, beta_1) for beta_0 in beta_0_vals for beta_1 in beta_1_vals])
grid = pd.DataFrame(grid)
grid = grid.rename(columns={0: "beta_0", 1: "beta_1"})
sse = []
for i in range(grid.shape[0]):
pars = [grid['beta_0'][i],grid['beta_1'][i]]
sse.append(lg_sse(pars))
grid['sse'] = sse
grid.sort_values('sse', ascending=True, inplace=True)
grid.head()
| beta_0 | beta_1 | sse | |
|---|---|---|---|
| 135651 | -145.945946 | 60.660661 | 9.624180e+06 |
| 136651 | -145.545546 | 60.660661 | 9.624194e+06 |
| 134651 | -146.346346 | 60.660661 | 9.624216e+06 |
| 137651 | -145.145145 | 60.660661 | 9.624260e+06 |
| 144650 | -142.342342 | 60.260260 | 9.624263e+06 |
Check with statsmodels:
linear_model.params
Intercept -144.385971 Height 60.496351 dtype: float64
Method C. Optimization (the real thing)
The Method of Least Squares
import scipy as sp
# - sse function
def lg_sse(pars, data):
# - pick up the parameters
beta_0 = pars[0]
beta_1 = pars[1]
# - predict from parameters
preds = beta_0+beta_1*fish_data['Height']
# - compute residuals
residuals = fish_data['Weight']-preds
# - square the residuals
residuals = residuals**2
# - sum of squares
residuals = residuals.sum()
# - out:
return residuals
# - initial (random) parameter values
init_beta_0 = np.random.uniform(low=-15, high=15, size=1)
init_beta_1 = np.random.uniform(low=-15, high=15, size=1)
init_pars = [init_beta_0, init_beta_1]
# - optimize w. Nelder-Mead
optimal_model = sp.optimize.minimize(
# - fun(parameters, args)
fun=lg_sse,
args = (fish_data),
x0 = init_pars,
method='Nelder-Mead')
# - optimal parameters
optimal_model.x
/var/folders/d5/r01186gs5wn60_lwmj4g21cm0000gn/T/ipykernel_723/2157183124.py:25: DeprecationWarning: Use of `minimize` with `x0.ndim != 1` is deprecated. Currently, singleton dimensions will be removed from `x0`, but an error will be raised in SciPy 1.11.0. optimal_model = sp.optimize.minimize(
array([-144.38594971, 60.49634763])
Check against statsmodels
linear_model.params
Intercept -144.385971 Height 60.496351 dtype: float64
Final value of the objective function (the model SSE, indeed):
lg_sse(pars=optimal_model.x, data=fish_data)
9624100.077295545
Check against statsmodels
linear_model.ssr
9624100.07729549
Error Surface Plot: The Objective Function
beta_0_vals = np.linspace(-200,-100,300)
beta_1_vals = np.linspace(0,90,300)
grid = np.array([(beta_0, beta_1) for beta_0 in beta_0_vals for beta_1 in beta_1_vals])
grid = pd.DataFrame(grid)
grid = grid.rename(columns={0: "beta_0", 1: "beta_1"})
sse = []
for i in range(grid.shape[0]):
pars = [grid['beta_0'][i],grid['beta_1'][i]]
sse.append(lg_sse(pars, fish_data))
grid['sse'] = sse
grid.sort_values('sse', ascending=True, inplace=True)
grid.head()
| beta_0 | beta_1 | sse | |
|---|---|---|---|
| 50001 | -144.481605 | 60.501672 | 9.624101e+06 |
| 50301 | -144.147157 | 60.501672 | 9.624113e+06 |
| 49701 | -144.816054 | 60.501672 | 9.624123e+06 |
| 50601 | -143.812709 | 60.501672 | 9.624161e+06 |
| 49401 | -145.150502 | 60.501672 | 9.624182e+06 |
This the function that we have minimized:
# - import plotly
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook"
# - Mesh3d: Objective Function
fig = go.Figure(data=[go.Mesh3d(
x=grid['beta_0'],
y=grid['beta_1'],
z=grid['sse'],
color='lightblue',
opacity=0.50)])
fig.update_layout(scene = dict(
xaxis_title='Beta_0',
yaxis_title='Beta_1',
zaxis_title='SSE'),
width=700,
margin=dict(r=20, b=10, l=10, t=10))
fig.show()
Back to statsmodels
# - Pearson's correlation (R-value) coefficient and R^2
print(f"Pearson's correlation (R-value): {round(np.sqrt(linear_model.rsquared), 4)}")
print(f"Coefficient of determination (R^2): {round(linear_model.rsquared, 4)}")
Pearson's correlation (R-value): 0.7243 Coefficient of determination (R^2): 0.5247
# - p-values of the model's parameters
print(f"p-values: \n{linear_model.pvalues}")
p-values: Intercept 1.881076e-03 Height 3.842342e-27 dtype: float64
# --- Predicting new data
predictions = pd.DataFrame(columns=['Height', 'Weight'])
# - sampling the new data from the normal distribution with the mean and std parameters taken from the original data
new_fish_height = rng.normal(loc=model_frame['Height'].mean(), scale=model_frame['Height'].std(), size=100)
# - clipping the negative values
new_fish_height = np.clip(new_fish_height, a_min=0, a_max=np.infty)
predictions['Height'] = new_fish_height
# - predicting the heights on the new data using the linear model
new_fish_weight = linear_model.predict(predictions['Height'])
# - displaying the new data and the corresponding predictions
predictions['Weight'] = new_fish_weight
print(predictions)
Height Weight 0 8.614782 376.776924 1 8.484943 368.922117 2 4.089841 103.034457 3 6.762779 264.737482 4 4.698844 139.876918 .. ... ... 95 2.952395 34.223173 96 3.595383 73.121582 97 11.121632 528.432200 98 11.460855 548.953950 99 11.363379 543.057012 [100 rows x 2 columns]
# - plotting the predictions
sns.lineplot(data=predictions, x='Height', y='Weight', color='red')
sns.scatterplot(data=predictions, x='Height', y='Weight', color='red', s=50)
sns.despine(offset=10, trim=True)
plt.title('Predictions', fontsize=14);
# R^2 upacked - calculating R^2 using the corresponding formulas
total_sse = np.sum((model_frame['Weight'] - model_frame['Weight'].mean())**2)
model_sse = np.sum((model_frame['Predicted Values'] - model_frame['Weight'].mean())**2)
residual_sse = np.sum(model_frame['Residuals']**2)
print(np.isclose(total_sse, model_sse + residual_sse))
True
# - calculating R^2 from the correlation
r2 = model_frame['Height'].corr(model_frame['Weight'])**2
print(f"r2 = {r2}\n")
# - or
corr_matrix = np.corrcoef(model_frame['Height'], model_frame['Weight'])**2
print(corr_matrix, '\n')
r2 = corr_matrix[0, 1]
print(f"r2 = {r2}")
r2 = 0.5246761559328881 [[1. 0.52467616] [0.52467616 1. ]] r2 = 0.5246761559328881
# - yet another way to calculate R^2
model_sse/total_sse
0.5246761559328881
# - correlation is the covariance of the scaled variables!
# - scaling the variables
x = (model_frame['Height'] - model_frame['Height'].mean())/model_frame['Height'].std()
y = (model_frame['Weight'] - model_frame['Weight'].mean())/model_frame['Weight'].std()
# - calculating the covariance
cov_matrix = np.cov(x, y)
cov = cov_matrix[0, 1]
print('cov.: ', cov)
print('R^2 = cov.^2: ', cov**2)
cov.: 0.7243453291993318 R^2 = cov.^2: 0.5246761559328884
# - plotting the distribution of the residuals
plt.figure(figsize=(8, 6))
sns.histplot(data=model_frame, x='Residuals', binwidth=10, color='tab:orange')
sns.despine(offset=10, trim=True)
plt.title('Model Residuals', fontsize=14);
# - plotting the qqplot
sm.qqplot(model_frame['Residuals'].values, line='q')
sns.despine(offset=10, trim=True)
plt.title('Q-Q Plot of Model Results', fontsize=14);
### --- Heteroskedacity
# - plotting the predicted values vs the residuals
plt.figure(figsize=(8, 6))
sns.scatterplot(data=model_frame, x='Predicted Values', y='Residuals')
sns.despine(offset=10, trim=True)
plt.axhline(y=0, color='red')
plt.title('Predicted vs Residual', fontsize=14);
# - identifying the influental cases
model_inf = linear_model.get_influence()
inf_frame = model_inf.summary_frame()
inf_frame
| dfb_Intercept | dfb_Height | cooks_d | standard_resid | hat_diag | dffits_internal | student_resid | dffits | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0.010916 | -0.059963 | 0.006823 | -1.259609 | 0.008528 | -0.116819 | -1.261984 | -0.117039 |
| 1 | 0.032398 | -0.085427 | 0.009019 | -1.301798 | 0.010531 | -0.134302 | -1.304707 | -0.134602 |
| 2 | 0.024818 | -0.068269 | 0.005990 | -1.073542 | 0.010288 | -0.109452 | -1.074067 | -0.109506 |
| 3 | 0.031016 | -0.074907 | 0.006424 | -1.067137 | 0.011157 | -0.113353 | -1.067612 | -0.113404 |
| 4 | 0.017523 | -0.046875 | 0.002770 | -0.724468 | 0.010445 | -0.074430 | -0.723367 | -0.074317 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 154 | 0.018540 | -0.015848 | 0.000175 | 0.123070 | 0.022599 | 0.018714 | 0.122683 | 0.018655 |
| 155 | 0.006371 | -0.005382 | 0.000021 | 0.044004 | 0.021029 | 0.006449 | 0.043864 | 0.006429 |
| 156 | 0.011339 | -0.009632 | 0.000066 | 0.076917 | 0.021726 | 0.011463 | 0.076673 | 0.011426 |
| 157 | -0.005431 | 0.004510 | 0.000015 | -0.039590 | 0.019101 | -0.005525 | -0.039464 | -0.005507 |
| 158 | -0.007274 | 0.006026 | 0.000027 | -0.053422 | 0.018852 | -0.007405 | -0.053252 | -0.007382 |
159 rows × 8 columns
### --- Influence Measures
# - calculating the Cook's distance /Cook and Weisberg (1982)/
# - consider values with Cook's distance greater than 1 to be problematic
w_cookD = np.argwhere(model_inf.cooks_distance[0] > 1)
w_cookD
array([], shape=(0, 1), dtype=int64)
# - Leverage: hat values
# - Average Leverage = (k+1)/n
# - k - num. of predictors
# - n - num. observations
# - Also termed: hat values, range: 0 - 1
# - Various criteria (twice the average leverage, three times the average leverage...)
k = 1
n = len(model_frame)
w_leverage = np.argwhere(model_inf.hat_matrix_diag > 2*(k+1)/n)
w_leverage
array([[27],
[28],
[29],
[30],
[31],
[32],
[33],
[34]])
# - influence plot
inf_plot_frame = pd.DataFrame(columns=['Residuals', 'Leverage', 'Cook Dist.'])
inf_plot_frame['Residuals'] = model_frame['Residuals']
inf_plot_frame['Leverage'] = model_inf.hat_matrix_diag
inf_plot_frame['Cook Dist.'] = model_inf.cooks_distance[0]
inf_plot_frame
| Residuals | Leverage | Cook Dist. | |
|---|---|---|---|
| 0 | -310.531995 | 0.008528 | 0.006823 |
| 1 | -320.608492 | 0.010531 | 0.009019 |
| 2 | -264.425765 | 0.010288 | 0.005990 |
| 3 | -262.732580 | 0.011157 | 0.006424 |
| 4 | -178.430623 | 0.010445 | 0.002770 |
| ... | ... | ... | ... |
| 154 | 30.124399 | 0.022599 | 0.000175 |
| 155 | 10.779838 | 0.021029 | 0.000021 |
| 156 | 18.835779 | 0.021726 | 0.000066 |
| 157 | -9.707947 | 0.019101 | 0.000015 |
| 158 | -13.101430 | 0.018852 | 0.000027 |
159 rows × 3 columns
# - plotting the influence plot
plt.figure(figsize=(8, 6))
sns.scatterplot(data=inf_plot_frame, x='Leverage', y='Residuals', size='Cook Dist.', legend=None)
sns.despine(offset=10, trim=True)
plt.title("Influence Plot\n Size of the blob corresponds to Cook's distance", fontsize=14);
# - import scikit-learn
from sklearn.linear_model import LinearRegression
# - fitting the linear model to the data
X = model_frame['Height'].values.reshape(-1, 1)
y = model_frame['Weight'].values.reshape(-1, 1)
lin_reg = LinearRegression()
lin_reg.fit(X, y)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
# - coefficents of the model
lin_reg.coef_, lin_reg.intercept_
(array([[60.49635121]]), array([-144.38597115]))
# - calculating R^2
lin_reg.score(X, y)
0.5246761559328882
# - predicting the weight using the model
X_pred = predictions['Height'].values.reshape(-1, 1)
y_pred = lin_reg.predict(X_pred)
y_pred[:10]
array([[ 376.77692446],
[ 368.92211676],
[ 103.03445733],
[ 264.73748233],
[ 139.87691798],
[ 629.30850633],
[ 269.15073697],
[1031.00519406],
[ 781.490085 ],
[ 281.01799317]])
# - comparing with the predictions from scipy's model
predictions['Weight']
0 376.776924
1 368.922117
2 103.034457
3 264.737482
4 139.876918
...
95 34.223173
96 73.121582
97 528.432200
98 548.953950
99 543.057012
Name: Weight, Length: 100, dtype: float64
First: the model parameters and their standard errors
# Get the estimated parameters and standard errors of the model
params = linear_model.params.values
stderr = linear_model.bse.values
# Print the results
print(f"Beta 0: {params[0]} (SE = {stderr[0]})")
print(f"Beta 1: {params[1]} (SE = {stderr[1]})")
Beta 0: -144.38597115478754 (SE = 45.66290399900891) Beta 1: 60.49635121272138 (SE = 4.595454722325197)
Second: the standard deviation of model residuals
error_residual = np.std(linear_model.resid, ddof=1)
error_residual
246.80361789420277
Third: the Sim-Fit Loop, Parametric Bootstrap
# import normal distribution from scipy
from scipy.stats import norm
# Predicted values
model_predictions = linear_model.predict()
# Define the number of bootstrap samples
n_samples = 100000
# Initialize empty arrays to store bootstrap estimates of coefficients
beta_0_samples = np.zeros(n_samples)
beta_1_samples = np.zeros(n_samples)
# Loop over the number of samples
for i in range(n_samples):
# Enter error_residual to model predictions
boostrap_predictions = model_predictions + norm.rvs(loc = 0,
scale= error_residual,
size = model_predictions.size)
resampled_data = pd.DataFrame({'Height': fish_data['Height'],
'Weight': boostrap_predictions})
# Fit the model to the boostrap_predictions
model_resampled = smf.ols(formula='Weight ~ Height', data=resampled_data).fit()
# Store the coefficients from the resampled data
beta_0_samples[i] = model_resampled.params.Intercept
beta_1_samples[i] = model_resampled.params.Height
# Calculate the bias and variance of the coefficients
beta_0_bias = np.mean(beta_0_samples) - linear_model.params.Intercept
beta_1_bias = np.mean(beta_1_samples) - linear_model.params.Height
beta_0_var = np.var(beta_0_samples)
beta_1_var = np.var(beta_1_samples)
# Print the results
print(f"Beta 0: bias = {beta_0_bias}, variance = {beta_0_var}")
print(f"Beta 1: bias = {beta_1_bias}, variance = {beta_1_var}")
Beta 0: bias = 0.1582529863985087, variance = 2066.160134292637 Beta 1: bias = -0.014426161790858316, variance = 20.870794060684904
Compare with model parameter variances as estimated from the original Linear Regression
# Print the results
print(f"Beta 0: variance = {stderr[0]**2}")
print(f"Beta 1: variance = {stderr[1]**2}")
Beta 0: variance = 2085.100801622704 Beta 1: variance = 21.118204104940954
License: [GPLv3](https://www.gnu.org/licenses/gpl-3.0.txt) This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.